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The actinopterygians comprise nearly one-half of all extant vertebrate species and are very important for 
human well-being. However, the phylogenetic relationships among certain groups within the 
actinopterygians are still uncertain, and debates about these relationships have continued for a long time. 
Along with the progress achieved in sequencing technologies, phylogenetic analyses based on multi-gene 
sequences, termed phylogenomic approaches, are becoming increasingly common and often result in 
well-resolved and highly supported phylogenetic hypotheses. Based on the transcriptome sequences 
generated in this study and the extensive expression data currently available from public databases, we 
obtained alignments of 274 orthologue groups for 26 scientifically and commercially important 
actinopterygians, representing 17 out of 44 orders within the class Actinopterygii. Using these alignments 
and probabilistic methods, we recovered relationships between basal actinopterygians and teleosts, among 
teleosts within protacanthopterygians and related lineages, and also within acanthomorphs. These 
relationships were recovered with high confidence. 

The actinopterygians (ray- finned fish) comprise approximately 28,000 extant species. This group is one of the 
major vertebrate groups, including nearly half of all extant vertebrate species 1 . Currently, according to 
molecular, morphological and paleontological studies, the actinopterygians, including 44 orders and 453 
families 1 , are interpreted as a taxon comprising four major groups: cladistians, chondrosteans, holosteans and 
teleosteans 2 " 4 . Considerable effort has been made over a long time to resolve the phylogeny of actinopterygians 
based on both morphological and molecular data. However, the phylogenetic relationships among the major 
groups of actinopterygians were still controversial and unresolved, as are many of the proposed higher-level taxa 
within the Teleostei (e.g., 5,6 ). Debates on the ordinal relationships among basal euteleosts, and on the most 
species-rich lineage, the Acanthomorpha, have long continued, although several new findings in molecular 
biology agree with results derived from morphological studies 7 " 9 . One of the major questions in actinopterygian 
phylogeny is the pattern of phylogenetic relationships among the higher "perch-like" fish, the order Perciformes 
and relatives (e.g. 61011 ). The monophyly of certain orders and families is in doubt, and this difficulty creates even 
greater problems 1 . 

Previous studies of actinopterygian phylogenies on the basis of nuclear genes focused primarily on particular 
groups and/or were usually based on relatively few markers. Even within the same species group, different gene 
markers have resulted in controversial phylogenies in certain cases. For example, MasonGamer and Kellogg 
found that gene trees of the grass tribe Triticeae resulting from four different single-gene data sets disagreed 
extensively in their intergeneric relationships 12 . Another study using four nuclear and two mitochondrial loci 
individually obtained different phylogenies among 17 Oriental Drosophila melanogaster species 13 . Rokas et al. 
selected 106 widely distributed orthologous genes from eight yeast genome sequences and concluded that a single 
or a small number of concatenated genes had a significant probability of supporting conflicting topologies, 
whereas more than 20 genes combined might yield a single, fully resolved species tree with maximum support 14 . 
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for each species used in the study 
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Nevertheless, increasing the number of genes for accurate phylogen- 
etic inferences inevitably constrains the number of analysed taxa and 
increases the percentage of missing data because of many limitations, 
such as time and resources. Furthermore, based on the aforemen- 
tioned datasets used by Rokas et al., Phillips et al. obtained 100% 
supported but mutually incongruent trees using different tree- 
reconstruction methods and suggested that this inconsistency 
resulted from a compositional bias 15 . For all these reasons, phyloge- 
nomic approaches in systematics based on the analysis of multi-gene 
sequence data are becoming increasingly common because large 
numbers of characters and independent evidence from many genetic 
loci often result in well- resolved and highly supported phylogenetic 
hypotheses 14 " 16 . Furthermore, recent simulation and empirical stud- 
ies have suggested that increases in gene sampling resulted in better 
performance than increases in taxon sampling 17 " 19 , and phylogenetic 
reconstruction appeared not to be sensitive to highly incomplete taxa 
as long as a sufficient number of characters were available 20-23 . 
Another advantage of phylogenomics is that the increasing through- 
put capacity of DNA sequencing technology has made available an 
ever-growing amount of sequence information, primarily in the form 
of large collections of expressed sequence tags (ESTs) or genome 
sequences. Phylogenetic inferences using a multi-locus approach, 
especially based on ESTs, are extensive because the use of ESTs can 
produce large numbers of gene sequences relatively easily and eco- 
nomically and can yield reliable and robust results 24 " 28 . Recently, 
Hittinger et al. sequenced transcriptomes of 10 mosquito species 
using the second-generation sequencing technologies and obtained 
robust phylogenetic inferences. They claimed this approach was an 
efficient, data-rich, and economical option for generating large num- 
bers of orthologous gene alignments for multi-locus phylogeny infer- 
ence 29 . In view of these results, it is possible that robust phylogeny 
inferences for actinopterygians can be resolved by multi-gene 
approaches using multi- origin expression data. 

Actinopterygians have been the group of vertebrates with the sec- 
ond best characterised genomes. Five fully sequenced and high- 
quality genomes are available for actinopterygians: Danio rerio 



(zebrafish), Gastroceus aculeatus (three-spined stickleback), Ory- 
zias latipes (Japanese medaka), Takifugu rubripes (Japanese puffer- 
fish), and Tetraodon nigroviridis (green spotted pufferfish). 
Additionally, many EST sequencing projects for a wide variety of 
teleost species have been conducted worldwide, and hundreds of 
thousands of EST sequences are available. However, current deep 
phylogenetic studies of actinopterygians are primarily based on 
mitochondrial genomic data. Studies of this type based on nuclear 
genes are rare, especially in association with large-scale expression 
data. In the present study, the transcriptomes of three basal actinop- 
terygians (Lepisosteus osseus, Polyodon spathula, and Polypterus 
delhezi) and two cypriniforms (Hypophthalmichthys molitrix, 
Hypophthalmichthys nobilis) were sequenced using the second-gen- 
eration sequencing technologies (see Materials and Methods). Based 
on expression data generated in this study and on the results of 
previous genome and EST sequencing projects, we obtained multi- 
locus orthologous gene alignments for 17 of 44 orders within the class 
Actinopterygii. Subsequent analyses were performed to resolve the 
relationships among these species on the basis of these alignments. 

Results 

Sequence analyses and alignment. The transcriptome sequences used 
in this analysis for three basal actinopterygians and two cypriniforms 
were generated by us de novo (additional information in supplemental 
table SI). Transcriptome sequences, ESTs, mRNAs, Unigenes or 
cDNAs for 21 other species were downloaded from public databases 
(see methods). Based on these multi-origin expression data, we obtained 
274 orhtologue groups (OGs) using OrthoSelect. The data profile for 
each species used in this study is shown in Table 1. Information for each 
OG (the number of species, length of alignment, percentage of missing 
data, best-fitting models of protein sequence evolution, and accession 
number for each sequence) is given in supplemental table S2. The 
alignment files generated for phylogenetic analyses are given in 
supplemental file SI. The distribution of the alignment lengths of the 
274 OGs is shown in Figure 1. The modal value of the alignment 
lengths appears to be in the range of 200-800 bp, with more than 
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Figure 1 | Distribution of nucleotide alignment lengths of the 274 
orthologue groups. 



90% shorter than 900 bp. Only 6 OGs had alignment lengths longer 
than 1000 bp, and the mean length of all orthologues was 496 bp. There 
was a bias against obtaining longer alignments (the majority of the 
alignment lengths were approximately 500 bp). The reason for this 
outcome may be that most of our sequences were obtained directly 
from expression data rather than complete sequencing. The 
proportions of missing data for our OGs ranged from 10.0% for 
OG2806 to 62.7% for OG1174. The total number of OGs and 
percentages of missing data for each species are shown in Table 1. 
The missing data within these species ranged from 4.42% (Danio 



rerio) to 84.86% {Lepisosteus osseus). The nucleotide supermatrix 
concatenated from these 274 OGs included 135,969 bp and entirely 
missed 38.9% of the nucleotides. The average nucleotide composition 
of the concatenated supermatrix sequences was A = 27.1%, C = 24.6%, 
G = 27.0% and T = 21.3%. 

Phylogeny inference based on nuclear multigenes. The con- 
catenated nucleotide (excluding the third codon positions) and its 
conceptually translated amino acid genetic datasets were subjected to 
both Maximum Likelihood (ML, partitioned and unpartitioned) and 
Bayes Inference (BI, only unpartitioned) analyses and produced a 
consistent topology with similar phylogenetic support values. Almost 
all nodes were fully supported by posterior probabilities for BI. For ML, 
the node for the two perciforms, Dicentrarchus labrax (European 
seabass) and Sparus aurata (gilthead seabream), as sister group was 
not highly supported by the bootstrap values (Figure 2 and 
supplemental Figure SI A-E). Both the AIC (Akaike information 
criterion) and the AICc values 30 showed that the likelihood value with 
the partitioned supermatrix was better than the value with the unpar- 
titioned supermatrix for the nucleotides. For the protein sequences, 
however, the likelihood value with the unpartitioned supermatrix 
was better than the value with the partitioned supermatrix. In- 
terestingly, we reconstructed almost the same topology (supple- 
mental SFigure 1 F and G), and the only difference was the 
placement of Oreochromis niloticus (Nile tilapia) based on the 
concatenated nucleotide supermatrix including the third codon 
positions. We recovered a monophyletic clade including Gaste- 
rosteus aculeatus (three-spined stickleback), Anoplopoma fimbria 
(sablefish), Sebastes caurinus (copper rockfish), Dissostichus 
mawsoni (Antarctic cod), and Hippoglossus hippoglossus (Atlantic 
halibut) with high confidence. Specifically, Gasterosteus aculeatus 
(Gasterosteiformes) and Anoplopoma fimbria (Scorpaeniformes) 
formed a sister-group relationship, and Sebastes caurinus (Scor- 
paeniformes) and Dissostichus mawsoni (Perciformes) formed 
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Figure 2 | The best-scoring maximum-likelihood (ML) tree derived from the concatenated supermatrix of the 274 nuclear genes (90,646bp, excluding 
the third codon positions) from the 26 actinopterygians with the GTRGAMMA model implemented in RAxML. Numbers besides internal branches 
indicate bootstrap values based on 100 replicates. Other phylogenetic tree reconstruction strategies implemented in this report all obtained the same 
topology as this and are shown in supplemental Figure SI. 
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another monophyletic group with Hippoglossus hippoglossus (Pleu- 
ronectiformes) branched basal to this clade. The order Tetrao- 
dontiformes was placed as the most primitive taxon within 
Percomorpha (except Oreochromis niloticus). Figure 2 also shows that 
Fundulus heteroclitus and Oryzias latipes are sister, with Oreochromis 
niloticus branched basal to this clade. The monophyly and placement of 
major taxa such as Teleostei (Elopomorpha + Ostarioclupeomorpha 
or Otocephala + Euteleostei), Ostarioclupeomorpha (represented 
by Siluriformes + Cypriniformes), Acanthomorpha (Acanthopterygii 
(Atherinomorpha + Percomorpha) + Paracanthopterygii), which 
have been accepted extensively, were supported strongly by our 
analysis. The clade Protacanthopterygii ((Esociformes + Salmoniformes) 
+ Osmeriformes) was recovered as monophyletic, with the Esociformes 
and the Salmoniformes as sister groups. As for the major actinoptery- 
gian clades, our results supported the topology (Polypteriformes, 
(Acipenseriformes, (Lepisosteiformes + Teleostei))). 

Discussion 

The extant basal actinopterygians include four major lineages, the 
Polypteriformes, Acipenseriformes, Lepisosteiformes, and Amii- 
formes. Although their basal positions within the actinopterygians 
have been consistently accepted by previous investigators 1 , consid- 
erable controversy over their relationship to the teleosts con- 
tinues 2 ' 4 ' 8 . We conducted a comparative analysis of the phylo- 
genetic positions of three lineages of basal actinopterygians 
(Polypteriformes, Acipenseriformes, and Lepisosteiformes) relative 
to the teleosts with former hypotheses (please refer to Arratia 200 1 31 , 
who presented all possible morphological and molecular hypothesis, 
and also to Arratia 2004 32 ). Our topology was in accordance with a 
previous conclusion based on gill-arch structure 33 , and with the first 
published significant hypothesis on the basal actinopterygian rela- 
tionships based on molecular data 34 . Many recent conclusions based 
on morphological and molecular data were also consistent with our 
topology 35 " 37 . In contrast, previous findings that acipenseriforms or 
lepisosteiforms are more closely related to teleosts based on mitoge- 
nomic data 8 or molecular synapomorphies 38 were weakly supported 
by our topological test (Table 2). Currently, the polypteriforms (e.g., 
armored bichir) are widely accepted as the sister group of all other 
extant actinopterygians 1 . However, because the results we presented 
here did not include Amia calva in the analysis, this conclusion may 
be subject to bias and may require further investigation. 

In addition to the basal actinopterygians, all other fishes in this 
study are collectively included within the Teleostei (Figure 2), which 
was represented by three main groups here: Elopomorpha, 
Ostarioclupeomorpha (= Otocephala), and Euteleostei. Generally, 
researchers agreed that the protacanthopterygians occupy a phylo- 
genetic position intermediate between the basal teleosts (ostarioclu- 
peomorphs and below) and neoteleosts (stomiiforms and above) 9 
and are interpreted as basal Euteleostei. Because many of the mor- 
phological characters of the group have a mosaic distribution, the 
composition of this assemblage has undergone numerous changes 
over the past many decades 1 . Additionally, the deep relationships of 
the protacanthopterygians are so complex and controversial 1 ' 9 that 
at least 10 different phylogenetic hypotheses have been proposed 
(Figure 3 A-J; note that argentinoids are not shown because they 
are absent from our analysis. For more information, see Ishiguro's 
figure 1 A-J 9 , Springer & Johnson's figure 3 39 , and Diogo's figure 2 40 ). 
Topological tests strongly suggested that our placement of the pro- 
tacanthopterygians and related lineages was correct and confidently 
rejected other dichotomous ones (Table 2). Among these hypotheses, 
the phylogenetic position of the esociforms is one of the most con- 
troversial 9,41 . Our analysis strongly supports the hypothesis that 
the sister taxa of the esociforms were the salmoniforms rather than 
Neoteleostei 39 ' 42 or Osmeriformes 40 . This sister-group relationship is 
in accordance with many morphology-based and nearly all molecu- 
lar-based hypotheses. Ramsden et al. corroborated this sister-group 



relationship from other perspectives, such as the life history and 
distribution of the fishes 43 . However, the placements of other lineages 
in these hypotheses are different from ours. For instance, the place- 
ment of Neoteleostei in our hypothesis is obviously different from the 
placement in earlier hypotheses except for that of Rosen 44 . Based on 
his morphological studies, Rosen suggested that protacanthoptery- 
gians were a monophyletic unit and that Protacanthopterygii and 
Neoteleostei formed a sister group (Fig 3A). This hypothesis is the 
same as ours. However, his placement of ostariophysans as a sister 
group to Protacanthopterygii and Neoteleostei was different from 
ours. Recently, several hypotheses based on mitochondrial data 
obtained the same topology as that found by our study. In fact, in 
the study of Ishiguro et al., the monophyly of protacanthopterygians 
cannot be rejected based on mitogenomic data if alepocephaloids are 
excluded and monophyly is enforced for the remaining groups of 
protacanthopterygians 9 . Before them, almost all morphology-based 
analyses consistently treated alepocephaloids and argentinoids, two 
suborders of the order Argentiniformes, as sister groups. However, 
Ishiguro et al.'s mitogenomic phylogenetic analysis argued that ale- 
pocephaloids were nested within the otocephalans with high statist- 
ical support 9 . Therefore, the phylogenetic position of these two 
lineages required further investigation. 

Many taxa within the Euteleostei (minus Protacanthopterygii) 
that had true spines in the dorsal, anal, and pelvic fins are included 
within the Acanthomorpha 1 . The superorder Acanthopterygii, which 
contains 13 orders, 267 families, 2,422 genera, and approximately 
15,000 species, can be divided into three large assemblages (termed 
Series, i.e., Mugilomorpha, Atherinomorpha, and Percomorpha), 
and is the most species-rich superorder within this taxon 1 ' 45 . 
Although many morphological and molecular studies have been 
conducted, the relationships among major lineages within the 
Acanthomorpha remain poorly defined 1 ' 6 ' 71011 ' 45 " 47 . In addition, cer- 
tain orders and families within this assemblage are not monophyletic 
and this made the situation even worse 1 . In this study, we intended to 
test the possibility of recovering their relationships using many genes 
rather than resolving them thoroughly. The monophyly of the series 
Atherinomorpha, containing the Atheriniformes, Beloniformes 
(including the Adrianichthyoidei), and Cyprinodontiformes has 
been consistently suggested 1 ' 48 . Similarly, Japanese medaka (Be- 
loniformes) and killifish (Cyprinodontiformes) were grouped as sis- 
ter groups with high confidence in this study. Moreover, we also 
recovered that one scorpaeniform fish was more closely related to 
the Antarctic cod (Perciformes), whereas the other scorpaeniform 
represented the sistergroup of three-spined stickleback (Gaste- 
rosteiformes). Certain species within Perciformes appeared more 
closely related to the orders Pleuronectiformes, Scorpaeniformes, 
and Gasterosteiformes, but another species (Oreochromis niloticus) 
was more closely related to Atherinomorpha. This result is consistent 
with previous studies that proposed that Scorpaeniformes and 
Perciformes may not be monophyletic 1 ' 45 ' 49 . Interestingly, in a pre- 
vious study based on mitogenomic sequences, Miya et al. found that 
internal branches among Percomorpha were only weakly supported 
but that members of Gasterosteiformes and Scorpaeniformes formed 
a strongly supported monophyletic group with a bootstrap value of 
100% 46 . Moreover, the affinity of the cichlids with members of the 
Atherinomorpha has been consistently supported by studies based 
on nuclear genes 17 ' 50 " 52 and mitochondrial genomes 35 ' 37 ' 48 ' 53 . This 
phylogenetic affinity is also supported by a unique egg morphology 
and spawning mode 48 . We recovered the tetraodontiforms as 
pre-perciforms with high confidence (Fig. 2). This result was in 
accordance with Springer and Johnson's finding, which was based 
on morphological studies 39 . However, evidence suggests that 
Scorpaeniformes (including the Dactylopteridae), Pleuronecti- 
formes, and Tetraodontiformes were most likely derivatives of perci- 
form lineages 1 . Accordingly, our placement of Tetraodontiforms 
may be an artifact resulting from sparse taxonomic sampling of those 
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Figure 3 | Ten alternative phylogenetic hypotheses for basal euteleosts published after Rosen (1974). A-H were modified from Ishiguro et al (2003), 
I was modified from Diogo (2008), and J was modified from Fu (2010) and Broughton (2010). All terminal taxa were standardised to the three major 
protacanthopterygian lineages analysed in the present study (indicated by bold face). 



species. Our multi-gene analysis recovered the relationships among 
most of these lineages. Nevertheless, many questions regarding the 
relationships among lineages within Acanthomorpha remain un- 
answered. For example, the monophyly of the Paracanthopterygii, 
the sister group of Atherinomorpha and Tetraodontiformes, the 
phylogenetic placement of Batrachoidiformes, and the relationships 
among lineages within Percomorpha have long been controversial 1 . 
The last-named question poses particular difficulties because the 
monophyly of these groups is questionable and phylogenetic conclu- 
sions will depend on the choice of representatives 50 . 

The deep phylogeny of actinopterygians is a long-standing and 
complex problem in the study of fish evolution. In this study, our 



taxon sampling for basal actinopterygians was purposefully chosen, 
but the information used for teleosts was based primarily on express- 
ion data available on public databases. We showed that phyloge- 
nomics based on integrating multi- origin expression data can 
recover their phylogeny with high confidence and that the major 
topology we obtained is consistent with that found by most previous 
studies. Moreover, the question of missing data is a significant prob- 
lem for large-scale phylogenomic analysis. Philippe et al. showed that 
a supermatrix alignment with 25% missing data can still confidently 
resolve the phylogeny of eukaryotes 21 . In the case of actinopterygian 
phylogeny, an alignment with 38.9% missing data can result in a 
correct topology with high support. These results suggest that even 



Table 2 Results from AU tests and SH tests among a 

Tree a 


Iternative tree topologies 
InL 


derived from analysis of nucleotide supermatrix of 274 OGs 
Diff -InL P b P c 


((((((Neo^SaLEsol^OsmJl^OsiEloJ^LepJ^AciJ^ol) 


-244294.24 


best 






((((((NecKSaLEsol^smD^s^ElolAil/Lepl^ol) 


-244303.22 


9.0 


0.035* 


0.048* 


(((((Neo^SaLEsol^smD^s^Elol^AciAepD^ol) 


-244302.48 


8.2 


0.089 


0.078 


(((((((Neo^alJ^Eso^Osmij^siEloJ^^LepJ^Pol) 


-245369.75 


1075.5 


3e-09* 


0* 


(((((((Neo^Osmi^Eso^a^Os^ElolAcilAepl^ol) 


-244407.86 


1 13.6 


2e-04* 


0.048* 


(((((((Neo^soJ^Osm^Sa^OsiEloJAciLepJ^Pol) 


-245293.64 


999.4 


2e-54* 


0* 


(((((((Neo^Osm^SalD^soi^s^EloJAcilAepl^ol) 


-245295.33 


1001.1 


4e-54* 


0* 


a Lep: Lepisosteiformes; Aci: Acipenseriformes; Pol: Polypteriformes; Elo: Elopiformes; Eso: Esociformes; Osm: Osmeriformes; Sal: Salr 
Statistically significant differences (< 0.05) denoted by asterisks, AU test. 
Statistically significant differences (< 0.05) denoted by asterisks, SH test. 


noniformes; Neo: Neoteleostei; Ost: Ostariophysi. 
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with insufficient taxon sampling and several data gaps, large-scale 
phylogenomics based on integrating multi-origin expression data 
can produce a relatively good resolution of the the deep phylogeny 
of actinopterygians. Further investigations based on more purpose- 
fully chosen species may completely reconstruct the relationships of 
actinopterygians and provide a reliable phylogenetic framework for 
studying actinopterygian evolution. 

Methods 

Data collection and processing. Transcriptome sequences of five ray-finned fish 
species, Hypophthalmichthys molitrix (silver carp), Hypophthalmichthys nobilis 
(bighead carp), Lepisosteus osseus (longnose gar), Polyodon spathala (spoonbill cat), 
and the outgroup, Polypterus delhezi (armored bichir) were originally generated by 
Solexa sequencing in this study. Specimens of these species were purchased from a 
commercial source. The total RNA of each species was extracted from pooled organs 
with Trizol (Invitrogen, Carlsbad, CA, USA) according to the manufacturer's 
instructions. Poly (A+) RNA isolation, cDNA synthesis, preparation, sequencing (on 
an Illumina Genome Analyzer), and assembly (using the SOAP software package 54 ) 
were performed at Beijing Genomics Institute. The assembled transcriptome 
sequences of European eel (Anguilla anguilla) were downloaded from EeelBase 
(http://compgen.bio.unipd.it/eeelbase/). 

ESTs and/or mRNAs of Anoplopoma fimbria (sablefish), Dicentrarchus labrax 
(European seabass), Dissostichus mawsoni (Antarctic cod), Esox lucius (Northern 
pike), Hippoglossus hippoglossus (Atlantic halibut), Osmerus mordax (rainbow smelt), 
Sebastes caurinus (copper rockfish) and Sparus aurata (gilthead seabream), were 
downloaded from the National Center for Biotechnology Information 
(www.ncbi.nlm.nih.gov, GenBank status on 23 Dec 2009). Unigenes for Fundulus 
heteroclitus (killifish), Gadus morhua (Atlantic cod), Ictalurus furcatus (blue catfish), 
Ictalurus punctatus (channel catfish), Oreochromis niloticus (Nile tilapia), Pimephales 
promelas (fathead minnow) and Salmo salar (Atlantic salmon) were also downloaded 
from this database (GenBank status on 23 Dec 2009). Various contaminants and low- 
quality and low- complexity sequences within these data were screened and trimmed 
using SeqClean (http://compbio.dfci.harvard.edu/tgi/software) with NCBI's UniVec 
as a screening file. 

Complementary DNA sequences of five model fish species, Danio rerio (zebrafish), 
Gasterosteus aculeatus (three- spined stickleback), Oryzias latipes (Japanese medaka), 
Takifugu rubripes (Japanese pufferfish), and Tetraodon nigroviridis (green spotted 
puffer), were retrieved from Ensembl (http://www.ensembl.org/, RELEASE62). 

Sequence selection and alignment. Orthologue assignments were achieved using the 
slightly modified OrthoSelect method 55 in this study. The default reference database 
of OrthoSelect was KOG (clusters of euKaryotic Orthologous Groups) and 
OrthoMCL, which included non-fish species. We know that teleosts have experienced 
the fish-specific genome duplication, which may result in "one2two" or "one2many" 
orthology relationships between teleosts and other species. To overcome this problem 
and to identify the orthology relationships unambiguously, we'd better use "one2one" 
orthology relationships as references. Therefore, we downloaded amino acid 
sequences of five model fish and their "one2one" relationships from Ensembl using 
BioMart. Each of these "one2one" sequence sets was termed an orthologue group 
(OG) in this study and the expression data were assigned to these OGs by a BLASTX 
analysis of individual EST sequences against all OG proteins. After the OG 
assignment, each sequence was translated using ESTScan 56 , Gene Wise 57 , and a 
standard six-frame translation using BioPerl and aligned to the best hit from the 
previous BLAST search using bl2seq 58 . The translated sequence with the lowest E- 
value was chosen as the correctly translated sequence. Subsequently, one sequence 
from each organism was selected to represent the most probable ortholog to each 
other in accordance with their strategy based on matching positions normalized by its 
length in pairwise comparisons with MUSCLE 59 . However, because many ESTs were 
low-quality and included some frameshift errors or premature stopcodons, plus the 
limitations of bl2seq, we may discard the true ortholog in some species. To overcome 
these problems, we translated the expression data into protein sequences using 
ESTScan, and found the best sequence from each database using hmmbuild and 
hmmsearch from the HMMER package 60 . After HMM selection, we obtained the 
orthology relationships for each OG. Then, we chose a model fish sequence and 
translated it into protein sequence, and compared it to its orthologues separately with 
Gene Wise (Only orthologue with a score more than 100 was retained). A customized 
Perl script was then used to extract matched nucleotides and to generate a sequence 
alignment for each OG. If a sequence was assigned to more than one OG, we 
discarded all these OGs to avoid any ambiguity. The OG alignments having more 
than 14 sequences were visually inspected and adjusted by hand using Bioedit (http:// 
www.mbio.ncsu.edu/BioEdit/bioedit.html). Finally, 274 OGs were selected and used 
for subsequent analyses. 

Phylogenetic analysis. The nucleotides (excluding the third codon positions) and the 
conceptually translated amino acid alignments of these OGs were each concatenated, 
respectively. Both of the two supermatrices were subjected to subsequent Bayesian 
inference (BI) and Maximum Likelihood (ML) analyses. BI was performed with the 
MPI version of MrBayes 3.1.2 61 , in which Markov Chain Monte Carlo (MCMC) 
calculations were spread across multiple CPUs and run on parallel computing 
architectures. The analysis was initiated from a random starting tree. Two runs with 



twelve chains of MCMC iterations were performed for 5 million generations 
(sampling trees every 100 generations) with the GTR + I + T models (for MrBayes 
and protein sequences, we used mixed + I + F) of sequence evolution, and the first 
20,000 trees (2 million generations) were discarded as burn-ins. The average standard 
deviation of the split frequencies of the MCMC runs was used as the convergence 
diagnostic. The 50% majority-rule consensus tree was determined to calculate the 
posterior probabilities for each node. A parallel version of RAxML 7.2.6 62 was used for 
constructing Maximum Likelihood (ML) trees with the GTRGAMMA model for 
both the partitioned and the unpartitioned supermatrices (for the unpartitioned 
protein supermatrix, we used the PROTGAMMAJTTF model; the best fitting models 
of protein sequence evolution for each OG are listed in supplemental table S2). The 
partitioned supermatrices allow RaxML to assign different parameters for each gene. 
One hundred replicates for rapid bootstrap analyses 62 were also performed with 
RAxML, and a 50% majority rule consensus was calculated to determine the support 
values for each node. Fianlly, we placed the root at the branch quarter of Polypterus 
using MEGA5 63 . The best-fitting models of protein sequence evolution were selected 
by ProtTest2.4 64 . Tests of alternative phylogenetic hypotheses were implemented in 
CONSEL 65 . 
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